AITopics | rlhf model

Collaborating Authors

rlhf model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling

Li, Margaret, Shi, Weijia, Pagnoni, Artidoro, West, Peter, Holtzman, Ari

arXiv.org Artificial IntelligenceJul-2-2024

RLHF-aligned LMs have shown unprecedented ability on both benchmarks and long-form text generation, yet they struggle with one foundational task: next-token prediction. As RLHF models become agent models aimed at interacting with humans, they seem to lose their world modeling -- the ability to predict what comes next in arbitrary documents, which is the foundational training objective of the Base LMs that RLHF adapts. Besides empirically demonstrating this trade-off, we propose a potential explanation: to perform coherent long-form generation, RLHF models restrict randomness via implicit blueprints. In particular, RLHF models concentrate probability on sets of anchor spans that co-occur across multiple generations for the same prompt, serving as textual scaffolding but also limiting a model's ability to generate documents that do not include these spans. We study this trade-off on the most effective current agent models, those aligned with RLHF, while exploring why this may remain a fundamental trade-off between models that act and those that predict, even as alignment techniques improve.

base model, probability, rlhf model, (13 more...)

arXiv.org Artificial Intelligence

2407.02446

Country:

North America > United States > Maryland > Baltimore (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada > British Columbia (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Trickle-down Impact of Reward (In-)consistency on RLHF

Shen, Lingfeng, Chen, Sihao, Song, Linfeng, Jin, Lifeng, Peng, Baolin, Mi, Haitao, Khashabi, Daniel, Yu, Dong

arXiv.org Artificial IntelligenceSep-28-2023

Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs -- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments -- and their impact on the downstream RLHF model. In this paper, we visit a series of research questions relevant to RM inconsistency: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. Each example in Contrast Instructions features a pair of lexically similar instructions with different ground truth responses. A consistent RM is expected to rank the corresponding instruction and response higher than other combinations. We observe that current RMs trained with the standard ranking objective fail miserably on Contrast Instructions compared to average humans. To show that RM consistency can be improved efficiently without using extra training budget, we propose two techniques ConvexDA and RewardFusion, which enhance reward consistency through extrapolation during the RM training and inference stage, respectively. We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process.

artificial intelligence, consistency, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2309.16155

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > Richmond County > New York City (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(5 more...)

Genre:

Research Report > New Finding (0.87)
Research Report > Experimental Study (0.66)

Industry:

Information Technology > Security & Privacy (0.70)
Information Technology > Software (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Discovering Language Model Behaviors with Model-Written Evaluations

Perez, Ethan, Ringer, Sam, Lukošiūtė, Kamilė, Nguyen, Karina, Chen, Edwin, Heiner, Scott, Pettit, Craig, Olsson, Catherine, Kundu, Sandipan, Kadavath, Saurav, Jones, Andy, Chen, Anna, Mann, Ben, Israel, Brian, Seethor, Bryan, McKinnon, Cameron, Olah, Christopher, Yan, Da, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Khundadze, Guro, Kernion, Jackson, Landis, James, Kerr, Jamie, Mueller, Jared, Hyun, Jeeyoon, Landau, Joshua, Ndousse, Kamal, Goldberg, Landon, Lovitt, Liane, Lucas, Martin, Sellitto, Michael, Zhang, Miranda, Kingsland, Neerav, Elhage, Nelson, Joseph, Nicholas, Mercado, Noemí, DasSarma, Nova, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Clark, Jack, Bowman, Samuel R., Askell, Amanda, Grosse, Roger, Hernandez, Danny, Ganguli, Deep, Hubinger, Evan, Schiefer, Nicholas, Kaplan, Jared

arXiv.org Artificial IntelligenceDec-19-2022

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

ai system, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2212.09251

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(13 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao, Kadavath, Saurav, Kundu, Sandipan, Askell, Amanda, Kernion, Jackson, Jones, Andy, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, McKinnon, Cameron, Chen, Carol, Olsson, Catherine, Olah, Christopher, Hernandez, Danny, Drain, Dawn, Ganguli, Deep, Li, Dustin, Tran-Johnson, Eli, Perez, Ethan, Kerr, Jamie, Mueller, Jared, Ladish, Jeffrey, Landau, Joshua, Ndousse, Kamal, Lukosuite, Kamile, Lovitt, Liane, Sellitto, Michael, Elhage, Nelson, Schiefer, Nicholas, Mercado, Noemi, DasSarma, Nova, Lasenby, Robert, Larson, Robin, Ringer, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Fort, Stanislav, Lanham, Tamera, Telleen-Lawton, Timothy, Conerly, Tom, Henighan, Tom, Hume, Tristan, Bowman, Samuel R., Hatfield-Dodds, Zac, Mann, Ben, Amodei, Dario, Joseph, Nicholas, McCandlish, Sam, Brown, Tom, Kaplan, Jared

arXiv.org Artificial IntelligenceDec-15-2022

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2212.08073

Country: North America > United States (1.00)

Genre: Research Report (0.84)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

Add feedback